Skip to main content

Massive Upgrade! Apache Doris 1.2.0 is Released

Release Notes

We are excited to announce the release of Apache Doris 1.2.0 on December 7, 2022! We take this opportunity to extend our heartfelt appreciation to the 118 contributors who have made over 2400 commits to the open source Apache Doris project.

Under the LTS version management mechanism, the previous Apache Doris 1.1.X versions have focused on bug fixing and stability guarantee. Now, it is time to provide users with a new taste. Apache Doris 1.2.0 is a comprehensively upgraded version with a surprising number of new features that can meet the needs of general users and developers.

In Apache Doris 1.2.0, we provide:

  • Full Vectorized-Engine support that brings greatly improved performance
  • Merge-on-Write support on Unique Key Model, which reduces query overhead and largely improves the reading efficiency on the updateable data model
  • Multi Catalog that enables Doris to quickly access external data sources including Hive Metastore, Elasticsearch, and JDBC
  • Light Schema Change, alllowing millisecond-level Schema Change operations and DML and DDL synchronization from upstream databases to Doris via Flink CDC
  • JDBC Facade in replacement of ODBC Facade
  • Support for Java UDF, Remote UDF, Array data type, and JSONB data type

Download & Install

Download from GitHub: https://github.com/apache/doris/releases

Download from official website: https://doris.apache.org/download

Source code: https://github.com/apache/doris/releases/tag/1.2.0-rc04

Downloading Instructions

Due to the file size limit of the Apache server, the Doris 1.2.0 binary program is divided into 3 packets, which you may download from the Doris website:

  • apache-doris-fe

  • apache-doris-be

  • apache-doris-java-udf-jar-with-dependencies

    apache-doris-java-udf-jar-with-dependencies is added to support the new JDBC external tables and Java UDF in Doris 1.2.0. After downloading, please find the java-udf-jar-with-dependencies.jar file in this packet and put it under the be/lib directory. This is necessary for the startup of BE.

For your convenience, we offer the precompiled binary program packets of Apache Doris 1.2.0 with a built-in Java 8 runtime environment. Click the links below and start downloading immediately:

Deployment Instructions

To upgrade to Apache Doris 1.2.0, you need to update the bin and lib directories under fe and be.

For more details, you may refer to:

You can always contact SelectDB engineers for help with version upgrading, feature verification, and production.

Highlights

Full Vectorized-Engine Support

All system modules are vectorized in Apache Doris 1.2.0, including Data Loading and Unloading, Schema Change, Compaction, and UDFs. The new vectorized engine can perfectly work in replacement of the old non-vectorized engine, so we are considering removing the code of the latter in future versions.

Besides vectorization, we have conducted full-link optimization covering data scanning, predicate computing, Aggregation ops, HashJoin ops, and Shuffle between ops. All this results in much faster performance.

Tests on various standard test sets show that Apache Doris 1.2.0 runs nearly 4 times as fast as 1.1.3 and 10 times as 0.15.0 in SSB-100-Flat scenarios. While in complex TPCH-100 scenarios, it reports 3-time and 11-time performance enhancements against 1.1.3 and 0.15.0, respectively.

Figure 1. SSB

Figure 2. TPC-H

After submitting the test results to ClickBench, an internationally recognized benchmark for databases, we proudly found out that Apache Doris 1.2.0 won second place in Cold Run and third place in Hot Run for the universal machine (c6a.4xlarge, 500gb gp2) in terms of query performance, with 8 SQLs being record-breakers. It tops the list of writing performance under the same machine model as it is able to write 70G of data within merely 415s and achieves a single-node write speed of over 170 MB/s.

Figure 3. Cold Run

Figure 4. Hot Run

Merge-on-Write on Unique Key Model

In older versions, Apache Doris realizes real-time update of data with the Unique Key data model. However, the Merge-on-Read implementation of this method can incur problems, including efficiency bottlenecks, unnecessary computing resource consumption and IO overheads, and jitters in queries.

That's why we added the Merge-on-Write data update mode. It marks the data that needs to be deleted or updated when the data is written, as a way to ensure the uniqueness of primary key in data writing. This has saved the trouble of deduplicating the primary keys in data reading, and thus reducing extra resource consumption in queries. In addition, this method supports predicate pushdown. It can make good use of the various indexing in Doris to performs data cropping on the IO level, reducing the data processing volume in reading and computing. This is how it brings significant performance increases in multiple query scenarios.

Simulation tests of various continuous data loading scenarios on the SSB-Flat dataset report 3~6 times performance improvements in most queries.

  • Scenarios: It is suitable for users with requirements of primary key uniqueness and frequent needs of real-time Upsert updates.
  • Usage: As a new feature, the Merge-on-Write implementation is disabled by default. You may enable it by adding the following Property to the CREATE TABLE statement:
“enable_unique_key_merge_on_write” = “true”

Due to the difference between versions, existing Unique Key tables do not support Merge-on-Write. Adding Property to the ALTER TABLE statement will not do the trick. You may only enable this implementation for newly created tables. You can use insert into new_table select * from old_table to convert an old table to a new one.

Multi Catalog

The Multi Catalog feature is to make it easier for users to interface with external data catalogs, and thus enhance Apache Doris' capabilities in Data Lake analysis and federated data query.

The previous versions only allows interfacing with external data sources at the Database or Table level. Hence, if the external data catalogs are of very different schema, or if they contain too many Databases or Tables, users will have to do manual mapping one by one, which is maintenance-intensive.

With the Multi Catalog feature in 1.2.0, you can connect to external data sources via the CREATE CATALOG command, and Doris will automatically map the external libraries and tables. After that, you can access the external data as easily as accessing internal data.

Currently, Multi Catalog supports the following data sources:

  • Hive Metastore: You can access data tables including Hive, Iceberg, Hudi, and also dock to data sources compatible with Hive Metastore, such as Aliyun's DataLake Formation. It also supports data access on HDFS and object stores.
  • Elasticsearch: You can access ES data sources.
  • JDBC: You can access MySQL data sources via JDBC.

Note: The corresponding privilege levels will be changed automatically. See the Notes for Version Upgrade section for details.

Documentation:

https://doris.apache.org/docs/dev/ecosystem/external-table/multi-catalog

Light Schema Change

Schema Change used to be a burdenous job in Apache Doris. It involved modification of data files that largely reduced efficiency under large cluster size and large data volume in tables. Plus, any changes in the upstream schema would require a suspension of data synchronization for manual Schema Change operation. This could increase development and maintenance costs and induce backlog of consumption data.

In the new version, it is no longer necessary to change the data file synchronously when adding and subtracting columns to the data table. You only need to update the metadata in FE**. This enables millisecond-level Schema Change operation.** Users can use Flink CDC to realize DML and DDL synchronization from upstream database to Doris.

As a new feature, Light Schema Change is disabled by default. You may enable it by adding the following Property to the CREATE TABLE statement:

"light_schema_change"="true"

Documentation:

https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE

JDBC Facade

In older versions of Apache Doris, users used ODBC external tables to access data sources such as MySQL, Oracle, SQL Server, PostgreSQL, but since different ODBC driver versions may cause system instability, we decided to replace ODBC external tables with JDBC external tables in 1.2.0, as JDBC interfaces are more unified and support diverse databases. External data sources that support JDBC protocol can be connected to Apache Doris 1.2.0. Currently, we support the following sources and we are making continuous efforts to expand the list:

  • MySQL
  • PostgreSQL
  • Oracle
  • SQLServer
  • ClickHouse

The ODBC feature will be removed in a later version. Please try to switch to the JDBC.

Documentation:

https://doris.apache.org/docs/dev/ecosystem/external-table/jdbc-of-doris/

Java UDF

Apache Doris 1.2.0 supports writing UDF/UDAF in Java in addition to C++, so users can use custom functions in the Java ecosystem. Meanwhile, through technologies such as off-heap memory and Zero Copy, the efficiency of cross-language data access has been greatly improved in 1.2.0.

Documentation:

https://doris.apache.org/docs/dev/ecosystem/udf/java-user-defined-function

Example: https://github.com/apache/doris/tree/master/samples/doris-demo

Remote UDF

The new version supports accessing remote UDF services through RPC, thus completely eliminating language restrictions for users to write UDFs. Users can use any programming language to implement custom functions and complete complex data analysis work.

Documentation:

https://doris.apache.org/docs/ecosystem/udf/remote-user-defined-function

Example: https://github.com/apache/doris/tree/master/samples/doris-demo

More Data Type Support

  1. Array type

The new version supports the Array type and nested Array types since the Array type is more applicable in scenarios such as user portraits and tags. We have also implemented a large number of array-related functions to better support the application of the Array type in actual usage.

Documentation:

https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Types/ARRAY

Related functions:

https://doris.apache.org/docs/dev/sql-manual/sql-functions/array-functions/array_max

  1. JSONB type

The new version supports the JSONB type, which provides a more compact encoding format and data access in the encoding format. Compared with JSON data stored in strings, the use of the JSONB type can increase performance by several times.

Documentation: https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Types/JSONB

Related functions: https://doris.apache.org/docs/dev/sql-manual/sql-functions/json-functions/jsonb_parse

Date V2/DatatimeV2

The new version supports Date V2 and DatatimeV2 data types, which are more efficient and support micro-second-level time accuracy.

Documentation:

Sphere of influence:

  1. Users need to specify DateV2 and DatetimeV2 when creating tables. The Date and Datetime of the existing tables will not be changed.
  2. If calculations are conducted between data of the new and old types (for example, equi-join), the old data types will be cast into the corresponding new types.
  3. See examples in the related documentation

New Memory Management Framework

In Apache Doris 1.2.0, we added a new Memory Tracker to record Doris BE process memory usage, including memory and cache space usage in queries, data loading, Compaction, and Schema Change. The Memory Tracker enables more fine-grained memory monitoring and control, which greatly reduces OOM problems caused by memory overruns and further improves system stability.

Documentation:

https://doris.apache.org/docs/dev/admin-manual/maint-monitor/memory-management/memory-tracker

Table Valued Function

Apache Doris 1.2.0 implements a set of Table Valued Function (TVF). TVF can be regarded as an ordinary table, which can appear in all places where "table" can appear in SQL.

For example, we can use S3 TVF to implement data import on object storage:

insert into tbl select * from s3("s3://bucket/file.*", "ak" = "xx", "sk" = "xxx") where c1 > 2;

Or directly query data files on HDFS:

insert into tbl select * from hdfs("hdfs://bucket/file.*") where c1 > 2;

TVF can help users make full use of the rich expressiveness of SQL and flexibly process various data.

Documentation:

https://doris.apache.org/docs/dev/sql-manual/sql-functions/table-functions/s3

https://doris.apache.org/docs/dev/sql-manual/sql-functions/table-functions/hdfs

More

1. A More Convenient Way to Create Partitions

Apache Doris 1.2.0 supports creating multiple partitions within a certain time range via the FROM TOcommand.

Documentation:

https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE

Example:

-- Create partitions based on DATE, supports the mixed use of batch create and single create
PARTITION BY RANGE(event_day)(
FROM ("2000-11-14") TO ("2021-11-14") INTERVAL 1 YEAR,
FROM ("2021-11-14") TO ("2022-11-14") INTERVAL 1 MONTH,
FROM ("2022-11-14") TO ("2023-01-03") INTERVAL 1 WEEK,
FROM ("2023-01-03") TO ("2023-01-14") INTERVAL 1 DAY,
PARTITION p_20230114 VALUES [('2023-01-14'), ('2023-01-15'))
)

-- Create partitions based on DATETIME
PARTITION BY RANGE(event_time)(
FROM ("2023-01-03 12") TO ("2023-01-14 22") INTERVAL 1 HOUR
)

2. Column Renaming

Apache Doris 1.2.0 supports column renaming for tables with Light Schema Change enabled.

Documentation:

https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Alter/ALTER-TABLE-RENAME

3. More Fine-Grained Privilege Management

5. Support Viewing the Catalog Recycle Bin

Apache Doris 1.2.0 supports viewing the catalog recycle bin through the show catalog recycle bin function.

Documentation:

https://doris.apache.org/docs/dev/sql-manual/sql-reference/Show-Statements/SHOW-CATALOG-RECYCLE-BIN

6. Support SELECT * EXCEPT syntax

Documentation:

https://doris.apache.org/docs/dev/data-table/basic-usage

7. Outfile Supports ORC Format Export and Multi-Byte Delimiters

Documentation:

https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE

8. Support to Modify the Number of Query Profiles That Can Be Saved Through Configuration

Search FE configuration item in the documentation: max_query_profile_num

9. DELETE Statement supports IN Predicate Conditions and Partition Pruning

Documentation:

https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/DELETE

10. Default Value of the Time Column Supporting Current_timestamp

Search "CURRENT_TIMESTAMP" in the documentation: https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE

11. Added Two System Tables: Backends, Rowsets

Documentation:

https://doris.apache.org/docs/dev/admin-manual/system-table/backends

https://doris.apache.org/docs/dev/admin-manual/system-table/rowsets

12. Backup and Restore

13. Support Data Balance Between Multiple Disks on the Same Machine

Documentation:

14. Routine Load supports subscribing to Kerberos-authenticated Kafka services

Search for kerberos in the documentation: https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual

15. New Built-in Functions

Added the following built-in functions:

  • cbrt
  • sequence_match/sequence_count
  • mask/mask_first_n/mask_last_n
  • elt
  • any/any_value
  • group_bitmap_xor
  • ntile
  • nvl
  • uuid
  • initcap
  • regexp_replace_one/regexp_extract_all
  • multi_search_all_positions/multi_match_any
  • domain/domain_without_www/protocol
  • running_difference
  • bitmap_hash64
  • murmur_hash3_64
  • to_monday
  • not_null_or_empty
  • window_funnel
  • group_bit_and/group_bit_or/group_bit_xor
  • outer combine
  • and all array functions

Upgrade Notice

Known Issues

  • Use JDK11 will cause BE crash, please use JDK8 instead.

Behavior Changed

  • Permission level changes

    Because the catalog level is introduced, the corresponding user permission level will also be changed automatically. The rules are as follows:

    • GlobalPrivs and ResourcePrivs remain unchanged
    • Added CatalogPrivs level.
    • The original DatabasePrivs level is added with the internal prefix (indicating the db in the internal catalog)
    • Add the internal prefix to the original TablePrivs level (representing tbl in the internal catalog)
  • In GroupBy and Having clauses, match on column names in preference to aliases. (#14408)

  • Creating columns starting with mv_ is no longer supported. mv_ is a reserved keyword in materialized views (#14361)

  • Removed the default limit of 65535 rows added by the order by statement, and added the session variable default_order_by_limit to configure this limit. (#12478)

  • In the table generated by "Create Table As Select", all string columns use the string type uniformly, and no longer distinguish varchar/char/string (#14382)

  • In the audit log, remove the word default_cluster before the db and user names. (#13499) (#11408)

  • Add sql digest field in audit log (#8919)

  • The union clause always changes the order by logic. In the new version, the order by clause will be executed after the union is executed, unless explicitly associated by parentheses. (#9745)

  • During the decommission operation, the tablet in the recycle bin will be ignored to ensure that the decomission can be completed. (#14028)

  • The returned result of Decimal will be displayed according to the precision declared in the original column, or according to the precision specified in the cast function. (#13437)

  • Changed column name length limit from 64 to 256 (#14671)

  • Changes to FE configuration items

    • The enable_vectorized_load parameter is enabled by default. (#11833)
    • Increased create_table_timeout value. The default timeout for table creation operations will be increased. (#13520)
    • Modify stream_load_default_timeout_second default value to 3 days.
    • Modify the default value of alter_table_timeout_second to one month.
    • Increase the parameter max_replica_count_when_schema_change to limit the number of replicas involved in the alter job, the default is 100000. (#12850)
    • Add disable_iceberg_hudi_table. The iceberg and hudi appearances are disabled by default, and the multi catalog function is recommended. (#13932)
  • Changes to BE configuration items

    • Removed disable_stream_load_2pc parameter. 2PC's stream load can be used directly. (#13520)
    • Modify tablet_rowset_stale_sweep_time_sec from 1800 seconds to 300 seconds.
    • Redesigned configuration item name about compaction (#13495)
    • Revisited parameter about memory optimization (#13781)
  • Session variable changes

    • Modify the variable enable_insert_strict to true by default. This will cause some insert operations that could be executed before, but inserted illegal values, to no longer be executed. (11866)

    • Modified variable enable_local_exchange to default to true (#13292)

    • Default data transmission via lz4 compression, controlled by variable fragment_transmission_compression_codec (#11955)

    • Add skip_storage_engine_merge variable for debugging unique or agg model data (#11952)

      Documentation: https://doris.apache.org/docs/dev/advanced/variables

  • The BE startup script will check whether the value is greater than 200W through /proc/sys/vm/max_map_count. Otherwise, the startup fails. (#11052)

  • Removed mini load interface (#10520)

  • FE Metadata Version

    FE Meta Version changed from 107 to 114, and cannot be rolled back after upgrading.

During Upgrade

  1. Upgrade preparation

    • Need to replace: lib, bin directory (start/stop scripts have been modified)
    • BE also needs to configure JAVA_HOME, and already supports JDBC Table and Java UDF.
    • The default JVM Xmx parameter in fe.conf is changed to 8GB.
  2. Possible errors during the upgrade process

    • The repeat function cannot be used and an error is reported: vectorized repeat function cannot be executed, you can turn off the vectorized execution engine before upgrading. (#13868)
    • schema change fails with error: desc_tbl is not set. Maybe the FE version is not equal to the BE (#13822)
    • Vectorized hash join cannot be used and an error will be reported. vectorized hash join cannot be executed. You can turn off the vectorized execution engine before upgrading. (#13753)

    The above errors will return to normal after a full upgrade.

Performance Impact

  • By default, JeMalloc is used as the memory allocator of the new version BE, replacing TcMalloc (#13367)
  • The batch size in the tablet sink is modified to be at least 8K. (#13912)
  • Disable chunk allocator by default (#13285)

API change